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Abstract. The Lempel-Ziv parsing of a string (LZ77 for short) is one of the most important and 
widely-used algorithmic tools in data compression and string processing. We show that the Lempel- 
Ziv parsing of a string of length n on an alphabet of size a can be computed in 0(n log log cr) time 
(O(n) time if we allow randomization) using 0(n log cr) bits of working space; that is, using space 
proportional to that of the input string in bits. The previous fastest algorithm using 0(n log cr) space 
takes 0(n(log<7 + log log n)) time. We also consider the important rightmost variant of the problem, 
where the goal is to associate with each phrase of the parsing its most recent occurrence in the input 
string. We solve this problem in 0(n( 1 + (log cr/^/log n)) time, using the same working space as above. 
The previous best solution for rightmost parsing uses 0(n(l + log cr/log log n)) time and 0(n log n) 
space. As a bonus, in our solution for rightmost parsing we provide a faster construction method for 
efficient 2D orthogonal range reporting , which is of independent interest. 
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1 Introduction 


For almost four decades the LZ parsing [ 45] (or LZ77) has been a mainstay of data compression, 
and is widely used today in several popular compression tools (such as gzip and 7 zip), as a part of 
larger storage and search systems [3llfil32| . and as a basic measure of information content Em 
In addition to its long history in compression, LZ77 also has a wealth of applications in string 
processing. The factorization reveals much of the repetitive structure of the input string and this 
can be exploited to design efficient algorithms and data structures. For example, optimal algorithms 
for computing all the tandem repetitions [28] and seeds m in a string, rely on LZ77. More recently, 
LZ77 has become a basis for pattern matching indexes [19130] and compressed data structures [6]. 

Because its computation is a time-space bottleneck in these and other useful and interesting 
applications, efficient algorithms for LZ77 factorization have been the focus of intense research 
almost since its discovery (see P25] for recent surveys). 

The LZ parsing breaks a string S up into z phrases (factors). The phrase starting at position i 
is either (a) the first occurrence of letter S[i] in S, or (b) the longest substring starting at position i 
that has at least one occurrence starting to the left of position i in S. For example, the LZ parsing 
of string S = araarraaa is a|r|a|ar|raa|a. Compression can be achieved by replacing phrases of 
type (b) with a pair of integers ( pi , ti) that indicate respectively the starting position and length of 
a previous occurrence of the phrase in S. For example, the fifth phrase raa would be represented 
with (1,3) because substring 5[1,1 + 3 — 1] = S[l,3] = raa is a prior occurrence. 

It is important to note that there is sometimes more than one previous occurrence of a phrase, 
leading to a choice of pi value. If pt < i is the largest possible for every phrase then we call the parsing 
rightmost. In their study on the bit-complexity of LZ compression, Ferragina et al. m showed that 
the rightmost parsing can lead to encodings asymptotically smaller than what is achievable with 
other choices of pi. 

Main results. This article elucidates an important part of the asymptotic landscape of LZ factor¬ 
ization algorithms. In particular: 

1. We describe an algorithm that computes the LZ factorization in 0(n log log a) time ( 0(n ) time if 
randomization is allowed) using only 0(n logo - ) bits of working space. This is the first algorithm 
for LZ parsing that uses compact spacqj and o(n(logcr + log log n)) time. 

2. Our initial approach does not provide any guarantees on the pi value it computes (other than 
that it is less than i, of course). In Section [6] we consider the problem of ensuring the pi 
value computed is the rightmost possible (computing the rightmost parsing). For this problem 
we describe an algorithm using 0(n logo - ) bits of space and 0(n(loglogcr + log c/ylogn)) 
(0(n(l + logcr/vdogn)) if we allow randomization) time. This significantly improves on the 
algorithm of Ferragina et al. [15] . which runs in 0(n(l + log < 7 / log log n)) and anyway requires 
0(n log n) bits of space. 

3. On the way to our rightmost parsing result we provide a faster preprocessing method for 2D 
orthogonal range reporting queries — a classic problem in computational geometry. Given n 
points the data structure requires 0(n log n) bits of space, 0(log e n) query time (per reported 
point), and, critically, 0(ny/\ogn) time to construct. The data structure has the additional 
property that the reported points are returned in order of their x-coordinate, which is essential 

1 Compact space means space within a constant factor of the size of the input, excluding lower-order terms; so 
0(n log a) bits of space in our case. 
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to our rightmost parsing algorithm, and many other applications too (see, e.g., [44139] ). Our 
result is a counterpart to the 0{nyJ\ogn) construction method for 2D orthogonal range counting , 
by Chan and Patra§cu pLO] . 

Roadmap. The following section sets notation and lays down basic concepts, data structures, and 
results. We then review related work in Section [3] Section [I] describes an algorithm for LZ parsing 
in compact space that runs in 0(nlog logo - ) time (0(n) time if we allow randomization). Section [6] 
deals with the rightmost parsing. Section [a] - where we describe our result for range reporting 

queries — can be read independently of the rest of the document if so desired. Open problems are 
offered in Section 0 

2 Tools 

Our model in this paper is the word RAM. Space bounds will always be expressed as the number 
of bits used by an algorithm. We now define some of the basic tools we use throughout. 

Strings. Throughout we consider a string X = X[l..n] = X[1]X[2]... X[n] of |X| = n symbols drawn 
from the alphabet [0..er — 1], For technical convenience we assume X[n] is a special “end of string” 
symbol, $, smaller than all other symbols in the alphabet. 

The reverse of X is denoted X. For i = l,...,n we write X[i..n] to denote the suffix of X of 
length n — i + 1, that is X[i..n] = X[z]X[z + 1] ... X[n]. We will often refer to suffix X[i..n] simply 
as “suffix z”. Similarly, we write X[l..z] to denote the prefix of X of length i. X[i..j] is the substring 
X[z]X[z+l]... X[j] of X that starts at position i and ends at position j. Slightly abusing the notation, 
a string X will also refer to the integer obtained by considering the bits in decreasing significance 
in left-to-right order (X[1] is most significant bit followed by X[2] and so on). 

Suffix Arrays. The suffix array [34] SAx (or just SA when the context is clear) of a string X is an array 
SA[1. .n] which contains a permutation of the integers [l..n] such that X[SA[l]..n] < X[SA[2]..n] < 

• • • < X[SA[n]..n]. In other words, SA[j] = i iff X[z..n] is the j th suffix of X in lexicographical order. 
The inverse suffix array ISA is the inverse of SA, that is IS A[z] = j iff SA [j] = i. 

For a string Y, the Y-interval in the suffix array SAx is the interval SA[s..e] that contains all 
suffixes having Y as a prefix. The Y-interval is a representation of the occurrences of Y in X. For a 
character c and a string Y, the computation of cY-interval from Y-interval is called a left extension 
and the computation of Y-interval from Yc-interval is called a right contraction. Left contraction 
and right extension are defined symmetrically. 

BWT and backward search. The Burrows-Wheeler Transform [8j denoted L[l..n] is a permutation 
of X such that L[z] = X[SA[z] — 1] if SA[z] > 1 and $ otherwise. We also define LF[z] = j iff 
SA[j] = SA[z] — 1, except when SA[z] = 1, in which case LF[z] = ISA[n]. Clearly, if we know I = ISA[1], 
we can invert the BWT to obtain the original string right-to-left via repeated applications of LF, 
outputing L[/], then L[LF[/]], then L[LF[LF[/]]], and so on. Note that, after /, this process also visits 
the positions in SA of suffixes n, then n — 1, and so on. 

Let C[c], for symbol c, be the number of symbols in X lexicographically smaller than c. The 
function rank(X,c, i), for string X, symbol c, and integer i, returns the number of occurrences of 
c in X[l..z]. It is well known that LF[z] = C[L[z]] + rank(L, L[z], z) and that this “special” form of 
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rank (where c = X[?']) can be computed in 0(1) time after 0(n) time preprocessing to build a data 
structure of size n log a + 0(n log log a) [7]. 

Furthermore, we can compute the left extension using C and rank. If SA[.s..e] is the Y-interval, 
then SA[C[c] + rank(L, c, s), C[c] + rank(L, c, e)] is the cY-interval. This is called backward search [T3]. 
Note that backward search either requires general rank queries, which can be answered in 0(log log a) 
time after 0(n) time preprocessing to build a data structure of size nlogcr + o{n log a) [22] or can 
be solved using a more sophisticated data structure which occupies the same asymtotic space but 
requires O(n) randomized time preprocessing [7]. 

There are many BWT construction algorithms (see [41] for a survey), and some of them operate 
in compact space. In particular, Hon, Sadakane, and Sung [ 23] show how to construct the BWT 
in 0(n log logo - ) time and O (nlogcr) bits of working space. More recently, Belazzougui [5] showed 
how the BWT can be computed in 0(n) time (randomized) and 0[n log a) bits of working space. 

Wavelet Trees. Wavelet trees [21] are a tool from compressed data structures [36] that encode a 
string S on alphabet in n log a + o(n log a) bits and allow fast computation of various queries on 
the original string such as access to the ith symbol, rank, select, and various range queries [37]. The 
operation rank is as defined above, while operation select(5, c, i) for symbol c, and integer i returns 
the position of the ith occurrence of c in X. 

Let 5[l..n] be a string of n symbols, where each symbol is in the range [l..cr]. The wavelet tree 
\Ng of S is a perfect binary tree with a leaves. The leaves are labelled left-to-right with the symbols 
[l..cr] in increasing order. For a given internal node v of the tree, let s v be the subsequence of S 
consisting of only the symbols on the leaves in the subtree rooted at v. We store at v a bitvector b v 
of |s„| bits, setting b v [i\ = 1 if symbol s„[i] appears in the right tree of v, and b v [i\ = 0 otherwise. 
Note that s v is not actually stored, only b v . Clearly \N$ requires nlogcr + o(n log a) bits. 

LZ 77. Before defining the LZ77 factorization, we introduce the concept of a longest previous factor 
(LPF). The LPF at position i in string X is a pair LPFxfi] = ( Pi , if) such that, pi < i, X\pi..pi+£f) = 
X[i..i + if, and £ t is maximized. In other words, X[i..i + £ t ) is the longest prefix of X[i..n] which 
also occurs at some position Pi < i in X. 

The LZ77 factorization (or LZ77 parsing) of a string X is then just a greedy, left-to-right parsing 
of X into longest previous factors. More precisely, if the jth LZ factor (or phrase ) in the parsing is 
to start at position i, then we output ( pi,£i ) (to represent the jth phrase), and then the (j + l)th 
phrase starts at position i + fj. The exception is the case ii = 0, which happens iff X[z] is the 
leftmost occurrence of a symbol in X. In this case we output (X[z],0) (to represent X[i..i]) and the 
next phrase starts at position i + 1. When ii > 0, the substring X\pi..pi + £i) is called the source of 
phrase X[i..i + if). We denote the number of phrases in the LZ77 parsing of X by z. 

Theorem 1 (e.g., Karkkainen [24] ) . The number of phrases z in the LZ11 parsing of a string 
of n symbols on an alphabet of size a is 0(n/ log CT n) 

3 Related Work 

There have been many LZ parsing algorithms published, especially recently. Most of these results 
make no promise about the rank of the previous factor occurrence they output. The current fastest 
algorithm in practice (ignoring memory constraints) is due to Karkkainen et al. [26]. Their algorithm 
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runs in optimal O(n) time and uses 2?rlogn + n log a bits of space. Goto and Bannai [20] improve 
space usage to n log n + n log a bits, while maintaining linear runtime. 

Compact-space algorithms are due to Ohlebusch and Gog m , Kreft and Navarro m , Karkkainen 
et al. [25] . and Yamamoto et al. [23]. All these approaches take in 0(n log n) time. Very recently a 
solution with nlogu + 0(n) space and 0(n( logo - + log log n)) time has been proposed in [29]. We 
show a significant improvement — to 0(n log logo - ) time — is possible in the compact setting. If 
we allow randomization, then our time becomes linear. 

There are significantly fewer results on the rightmost problem. Early algorithmic work is due 
to Amir, Landau and Ukkonen [2], who provide an 0(n log n) time and space solution, but it 
should be noted that selection of rightmost factors had already been used as a heuristic in the 
data compression community for years (for example, in the popular gzip compressor). Recently, 
Larsson m showed how to compute the rightmost parsing online in the same 0(n log n) time and 
space bounds as Amir et al.. Currently the best prior result for rightmost parsing is an 0(n log n) 
space and Oin + n log a/ log log n) time algorithm due to Ferragina, Nitto and Venturini |T5] . We 
provide a faster algorithm that uses significantly less space. 

Our result for rightmost makes use of an improved technique for range predecessor queried 1. By 
combining text indexing with range reporting our thus work continues a long tradition in string 
processing, recently surveyed by Lewenstein [33]. Given a set of two-dimensional points P, the 
answer to an orthogonal range predecessor query Q = [a, +oo] x [c, d\ is the point p € P with 
largest y-coordinate among all points that are in the rectangle Q. 

If one is allowed 0(n log n) bits of space, a data structure due to Yu, Hon, and Wang [12] 
supports range predecessor queries in 0(log n/ log log n) time and takes 0(n log n/ log log n) time 
to construct. Navarro and Nekrich [39 ] subsequently improved query time to 0(log e n), where 
e is an arbitrarily small positive constant, however their structure has 0(n log n) construction 
time [38]. Multiary wavelet trees are also capable of answering range predecessor queries, and 
recently Munro, Nekrich, and Vitter [35] (and contemporaneously Babenko, Gawrychowski, Ko- 
ciumaka, and Starikovskaya 0) showed how to construct wavelet trees in 0{n log cr/^/logn) time 
using 0(n log n) bits of working space, and supporting queries in O (log a/ log log n) time. 

Finally, we note that a data structure for range predecessor queries immediately implies one 
for the classic and much studied 2D orthogonal range reporting problem from computational ge¬ 
ometry p], in which we seek a data structure to report all points contained in a four-sided query 
rectangle Q = [a,b\ x [c, d\. Our range predecessor result is a 0(n logn)-bit data structure with 
query time 0(log e n) (for any constant e < 1) that can be built in 0{ny/\ogn ) time. This matches 
the best known query time of for this problem when using 0{n\ogn ) bits of space, and to our 
knowledge is the first to offer construction time o(n log n/ log log n). To put our result in context, 
Chan and Patra§cu m have shown that a 2D range counting data structure with O(logn) query 
time and 0(n log n) bits of space can be built in 0(n^l og n) time. 

4 Lempel-Ziv Parsing in Compact Space 

Assume the next LZ factor starts at position i in the string. Our basic approach to compute the 
factor is to treat X[i,n] as a pattern and perform a prefix search for it in X, which we simulate via 
backward search steps on the FM-index of the reverse text Xh Consider a generic step j of this 
backward search, in which we have a range [sj,ej\ of the BWT and SA of X'. The factor beginning 

2 Elsewhere these queries are variously called range successor [391 and range next value queries | 12 |. 
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at i has length at least j if and only if SAx' [sy, e } \ contains a value pt > i. To see this, observe 
that the presence of such a p* in SAx' [sj, ej\ means there is a substring X'\jj l ..p l + j] that occurs 
after substring X[i..i + j] = X'[n — i — j..n — i] in X', which in turn implies there is an occurence of 
X[i..i + j] before position * in X (starting at position n — Pi, in fact). 

Our problem now is to be able to determine if SAx'[sj, e.,] contains a value larger than i at 
any given step in the above process. One solution is to preprocess SA for range maximum queries. 
However, this requires that we either first store SA in plain form, which requires 0(nlogn) bits, or 
that we obtain the values of SA left-to-right SA[1], SA[2],... (the order in which they are required 
for RMQ preprocessing) via repeated decoding using the SA sample, requiring 0(n log n) time. 
Either of these straightforward methods uses more space or time than we desire. 

Our approach instead then is to logically divide the BWT into equal-sized blocks of size b = 
logn/2. We then invert the BWT, and during the inversion we record for each block the maximum 
value of SA[i] that we see over the whole inversion process. We store an array, A[l,n/6] of these 
block maxima. Storing A requires nlogn/b = 2 n = 0(n ) bits. We now build the succinct RMQ 
data structure of Fischer and Heun [18j on the array A, which requires 3 n/b + o(n/b) bits (including 
during construction) and 0(n ) time. So far we have used O(n) bits and O(n) time for the inversion. 

We are now ready to describe the LZ factorization algorithm, which will involve another inver¬ 
sion of the BWT. This time we maintain a bit vector B[l,n]. If, during inversion, we visit position 
j in the BWT, then we set B[)] = 1. At the same time (i.e. while) we are inverting we will perform 
a backward search for the next LZ factor, using the as yet unprocessed portion of X as the pattern. 

Say we have factorized part of the string and we are now looking for the LZ factor that starts 
at i. We match symbols of i using backward search. At a given point in this process we have 
matched £ symbols and have an interval of the SA, say SA[s,e]. We need to decide if there exists a 
p < i € SA[s, e], which will tell us there is an occurrence of X[p..p + £] = X[i..i + £] before i. 

SA[.s,e] can be divided into at most three subranges: one that is covered by a series of block 
maxima (i.e. a subarray of A ), and at most two small subranges at each end, [s, s'] and [e', e], each 
of size at most [logn/2j. We compute the maximum value covered by the block maxima in 0(1) 
time using the RMQ data structure we built over A. For the two small subranges at each end of 
SA[s,e] that are not fully covered by block maxima we consult B. If there is a bit set in either of 
the B[s,s'] or Bfe^e] then we know there is some suffix in there that is greater than i (because it 
has already been visited in the inversion process). Because the (sub)bitvectors B[s, s'] or Bfe', e] are 
so small (< [logn/2j), we can use a lookup table to determine if there is a set bit in 0(1) time, 
and further we can have the lookup table return the position of one of these set bits. In this way 
we are able to determine in constant time whether we have reached the end of the current factor. 

Having determined the length of the current factor, it is then simply a matter of using a sampled 
SA of size 0(n/logn ) elements that allows us to extract arbitrary SA elements in O(logn) time [14] 
to obtain one of the candidate SA values from the previous round in the search for the current 
factor (so that we can output a pi value for the factor). This takes O(logn) time per factor, which 
over all the z = 0(n/ log CT n) factors takes 0(n log a). However, runtime can be further reduced to 
0{n ) over all factors if we first record the positions of the candidate pi values for each factor (using 
0(n log <r) bits of space) and obtain them all in a single further inversion of the BWT. 

As described, our factorization algorithm requires 0[n ) time and 0{n log a) bits of space in 
addition to the resources needed to construct the BWT and perform n backward search steps. We 
thus have the following theorem. 
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Theorem 2. Given a string S ofn symbols on an ordered alphabet of size o we can compute the LZ 
factorization of S using O(nlogcr) bits of space and 0{n log log a) time or 0(n ) time (randomized). 

5 Faster preprocessing for range-predecessor queries 

In the range predecessor problem in rank space, we are to preprocess n points on a [l,n] x [1 ,n] 
grid, where all points differ in both coordinates, so as to answer to the following kind of query: 
given integers x\,X 2 and y find the point (u,v) such that x\ < u < X 2 , v < y and v is maximal. 

Navarro and Nekrich presented a solution to this problem [39] that uses space 0{n log n/e) bits 
and answers queries in time 0(log e n). However, they did not show how to efficiently construct their 
data structure. 

We now show how to efficiently build a variant of the known solutions for range predecessor. 
The solution we describe here has query time O (ydog n log log n) . We later show how to generalize 
it to have query time 0(log e n) for arbitrary 0 < e < 1. 

We start by defining the sequence Y of length n over alphabet [l..n] obtained by setting Y[x\ = y 
for every point (x, y) in the set of input points. We similarly define the sequence X such that 
X[y\ = x for every input point (x,y). 

At a high-level, the solution uses a top-level data structure that resembles a multiary wavelet 
tree m with arity 2'V /logn and depth \/log n. We note that a standard wavelet tree can answer 
range predecessor queries in time O(logn). Without loss of generality, we assume that \J\ogn is 
integral and that logn is divisible by y/Xogn. The top-level data structure is a tree with \/log n 
levels. The arity of the tree is exactly 2V /lo s n . At any level i € [CLydogn — 1], we will have 2*V lo s n 

nodes labelled with values [0..2*\/ logn _ i] (note that at level 0 we will only have the root node, 
which is labelled by 0). Any node a at level i will have as children all nodes /3 at level i + 1 such 
that /3[l.i\/log n\ = a. 

To every node a in the tree we associate a sequence Y a of length n a . The sequence Y a will be 
over alphabet [1..2^ logn ], while the sequence Y' a is a sequence of integers from [1..2^V /logn_ ^''/ 1 °s n ]. 
Every node of the tree will contain the following substructures: 

1. A plain representation of the sequence Y a . This sequence occupies n\J log n bits of space. 

2. A regular wavelet tree m W a over the sequence Y a . This wavelet tree will have depth \/log n. 
It can be used to answer to range predecessor queries over the sequence Y a in time 0(y/ log n). 

3. Exactly 2^ logn predecessor data structures that support predecessor (rank) queries in 0(log log n ) 
time. For each character, c € [1..2V log?l ], we store its positions of occurrence in Y a in a pre¬ 
decessor data structure denoted P( a ,c) • The predecessor data structures are implemented using 
Elias-Fano data structure in such a way that they occupy in total 0(n a y/ log n) bits of space. 

4. A range minimum query data structure on the sequence Y a denoted Rmin^. The data structure 
will occupy 0(n a ) bits and answer queries in constant tile. 

5. A range maximum query data structure on the sequence Y a denoted Rmax Q . This data structure 
will also occupy 0(n a ) bits and answer queries in constant tile. 

All data structures of a node a will use 0{n a y/ log n) bits of space (for predecessor data structures 
we count the space as if it was one single structure) except for range minimum and range maximum 
data structures, which will use 0{n a ) bits. A detailed description of the predecessor data structure 
is given in Appendix O The space for all nodes at the same level will sum up to 0(n-^/logn) and 
since we have y/logn levels, the total space will sum up to O(nlogn) bits. 
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We now define how the sequences Y' a and Y a are builtH. At any level i € [O.-yTogn — 1] we 
will have 2 l ^/ log71 nodes. For a € [O..2*v /logn _ the sequence Yf is built by first constructing the 
subsequence Y" of n a values in Y whose i\J log n most significant bits equal a (i.e. Y[l..iy/\ogn\ = 
a), and then removing the most significant i^/\ogn bits from every element in Y" (that is Y'\j\ = 
Y"\j\ [ii/log n + l..logn] for all j € [l..n a ]). Then Y a is obtained from Y' a by taking the most 
significant ydbgn bits from every element of Y' a (that is Y\j] = y , [j][l..- v /log n\ for all j € [l..n Q ]). 
Notice that for the root node we will have Yq = Y. The total number of nodes will be dominated 

by the nodes at the lowest level, summing up to 0{n/2^ logn ). 

With our data structure now defined, we will next show how to construct it efficiently. The 
description of how queries are answered is given in Appendix [Bj 

5.1 Construction of the range-predecessor data structure 

Building all subsequences Y a and wavelet trees W a can be done in time 0(ny/logn) using a variation 
of the algorithm shown in [sag. Details are shown in Appendix [A] The construction of Elias- 
Fano data structures can also easily be done in overall time 0(n^/logn). Details are shown in 
Appendix lC.il 

Each range minimum and maximum query data structure can be constructed in time 0(n a ) 
using the algorithm of Fischer and Heun [18] . When summed over all the nodes a, the construction 
and range minimum (maximum) data structures takes time O(nvdogn), since the total number of 
elements stored in all the structures is 0(n^/logn). We have thus proved the following theorem. 

Theorem 3. Given n points from the grid [1, n] 2 , we can in 0{ny/\ogn ) time build a data structure 
that occupies 0(n log n) bits of space and that answers range predecessor queries in 0(y/ log n log log n) 
time. The construction uses 0(n log n) bits of working space. 

We can generalize the data structure as follows. 

Theorem 4. Assume that we have available space N and preprocessing time 0{N) with word- 
length w > log N. Then given n points from the grid [1, n] 2 , with n < N and a parameter c> 2, we 
can in 0(n(log n/yj log N + c)) time build a data structure that occupies 0{cn log n) bits of space 
and that answers range predecessor queries in 0(c log^ 1 /^ n log log n) time. The construction of the 
data structure uses 0(n log n) bits of working space and a precomputed global table of size o(N ) 
that can be built in o(N) time. The precomputed table can be shared by many instances of the data 
structure. 

Proof. The proof is more involved. To prove the result we will use multiple levels of granularity. 
At the top level, we will have a tree of log 1 ^ n (tree) levels (to avoid confusion we call these tree 
levels ), where each node handles a subsequence of Y over alphabet [1..2 log(c 1)/Cn ] [)]. The level of 
granularity of this tree is log^ c_1 ^ c n. For each node the data structures are exactly the same as 
the ones in Theorem [4[ except that the wavelet tree of each sequence is replaced by tree at level 
of granularity log^ c ” 2 ^ c n, which contains log 1//c n (tree) levels, each of which handles a sequence 
over alphabet [1..2 log " n ]. The recursion continues in the same way until we get to trees at level 

3 Note that the sequence is not used explicitly in the data structure. It will however be used later, when we show 
how the data structure is queried and constructed. 

4 We assume without loss of generality that log 1 ^ n is integral 
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of granularity 1, which is implemented using a wavelet tree. Queries are answered in two phases. 
In the first we determine the longest common prefix between the y coordinate of the query and 
the y coordinate of the answer by traversing trees at decreasing levels of granularities (tree at 
level log^ c_1 ^ c n, then level of granularity log( c_2 ^ c n and so on) and querying the range-maximum 
and predecessor data structures at each traversed node. Then the remaining bits of the answer 
are determined by traversing trees of increased levels of granularity, querying range-maximum and 
predecessor data structures. The time bound 0(clog 1//c nloglogn), follows because we have c levels 
of granularity and at most 0(log 1//c n) nodes are traversed at each level of granularity, where queries 
at each node cost O (log log n) time. Details of how queries are answered are given in Appendix lB.il 
The main challenge is to quickly construct the Elias-Fano data structures. This is shown in 
Appendix lC.2l The construction for range minimum (maximum) or queries is shown in Appendix iDl 
Both construction methods make use of bit-level parallelism to accelerate processing. □ 

As an immediate corollary, we have the following: 

Corollary 1. Given n points from the grid [l,n] 2 , for any integer c > 2, we can in 0(ny/logn) 
time build a data structure that occupies 0(cn log n) bits of space and that answers range predecessor 
queries in 0{c log( n log log n) time. 

6 Rightmost Parsing 

We will now apply the construction for range predecessor detailed above to obtain a faster algorithm 
for computing rightmost previous occurrences of LZ phrases. An algorithm by Ferragina et al. [15] 
achieves 0(n(l + log a/ log log n)) time, but requires 0(n log n) bits of space that can not trivially 
be reduced. 

In this section, we will achieve time 0(n{ 1 + loger/ylog n)) and O(nloger) bits of space, sig¬ 
nificantly improving the bounds achieved by Ferragina et al. [15] . We first present a preliminary 
solution from m in Section IQ and then present an initial version of our solution that uses 
0(n log n) space (sections 16.2116.31 and 16.51) . The solution works by decomposing the phrases into 
3 categories. Finding the rightmost occurrences will be fast for different reasons. For the first two 
categories, the reason is that the number of phrases is small, while for the last the reason is that 
the rightmost occurrence for each individual phrase will be easier to find. 

We divide the full range [l..n] in the suffix array into blocks of a certain size B. Phrases longer 
than a certain length l are handled in Section 16.211 , phrases shorter than a certain length i whose 
suffix array ranges cross a block boundary are handled in Section 16.31 Finally the remaining phrases 
are handled in Section T6.41 

6.1 Basic solution of Ferragina et al. 

We present the basic solution of m- The original algorithm uses 0{na) time and 0(n log n) space. 
Here, we describe a slightly improved version that uses only 0(nlog <r) bits. The algorithm works 
as follows. To each phrase, we can associate a leaf and an internal node in the suffix tree. The leaf 
corresponds to the suffix (text position) that starts at the same position as the phrase and the 
internal node corresponds to the range of suffixes prefixed by the phrase. We mark in the suffix 
tree all the internal nodes that are associated with phrases. To each leaf, we associate the nearest 
marked ancestor. This requires 0{n ) bits and 0{n) time in total. Also, to each leaf associated with a 








phrase, we keep a pointer to the internal node that corresponds to that phrase. This association can 
be coded in 0(nlogo - ) bits, since we can use a bitvector to mark leaves together with an array of at 
most z = 0(n/\og a n) pointers, each of O(logn) bits. All the structures can be built in 0(n) time. 
We keep only the marked nodes in the suffix tree. To each marked node, we keep all the phrases 
that correspond to it. To each marked node a, we keep a text position p a that will point to the 
rightmost position among all the leaves that have node a as their nearest marked ancestor. Overall 
the space occupied by all data structures is 0(n log a) bits. The algorithm works by scanning the 
inverse suffix array in left-to-right order. That is, at every step i, we extract the suffix array position 
(suffix tree leaf) that points to text position i, and update the variable p a , where a is the nearest 
marked ancestor of the leaf. When we arrive at a leaf that corresponds to a phrase, we go to the 
corresponding node, and then explore the entire subtree under that node and take the maximum of 
all variables p a for all nodes a in the subtree. The scanning of the inverse suffix array, can be done 
by inverting the Burrows-Wheeler transform in increasing text order. This can be done using the 
select operation which can be answered in constant time. Thus the inversion takes time O(n). The 
overall space is 0(na). The time bound comes from the fact that each internal node corresponds 
to a phrase of length m, and can only have at most m ancestors. Since at most a phrases are 
associated with each of the ancestors, the node can only be explored am. times: a times for each of 
its m ancestors. Since the total length of the phrases is to, we conclude that the total running time 
is 0(ncr). We thus have the following theorem. 

Theorem 5 (space improved from 15|). We can find the right-most positions of all phrases 
in time 0(na) and working space 0(n log a) bits. 

6.2 Long factors 

Because factors do not overlap, there can only be 0{n/i) factors of length at least i. We thus can 
afford to use time O(i) to find the rightmost occurrence of each factor. We sample every rth position 
in the text (r and l will be set later). We then build a five-sided 3D range maximal query data 
structure as follows. We will have the text X[1 ..to] with split points at positions ir, (i + l)r .... We 
then store n/r points as follows. For every i E [l.m/r], we store point (x. y, i) where x represents the 
lexicographic rank of the reverse of substring X[(i— l)r+\..ir\ among all substrings X[(i—l)r+l, ir\, 
and y the rank of the suffix X[ir + 1..to]. A query will consist of a triplet ([aq, £ 2 ], [ 2 / 1 , 2 / 2 ], z) an d will 
return the point (x,y,i) with maximal coordinate i among all points that satisfy that x E [xi,xfi[, 
y E [y 1 , 2 / 2 ] and z < i. In this way we store to' = 0(n/r) points in total. We store the set S of 
reverse of all substrings X[(z — 1 )r +1 fir] for i E [l..n/r] in a table Tg sorted in lexicographic order. 
Given any string p, we can to determine the range of elements of S which have reverse of p as a 
prefix. The table T$ can be built in 0{(n/r)(\ogn + )) = 0(to(^^ + ]2|^)). Reverting every 

string of p can be done by using a lookup table LT which stores the reverse of every possible string 
of length logo- to/ 2. The space used by the lookup table will be 0(y/n\ogn) bits and will allow to 
revert every string of length log^TO/2 in constant time. 

The data structure we use occupies 0(n'log(n')) = 0(nlog 2 n/r) bits of space and answers 
queries in time 0(log 2 to' log log to'). This is obtained by building log to data structures for 2D range 
maximal queries m- By building a perfect binary search tree on the third dimension z, then 
building a 2D range maximum query data structure for all points that fall in one subtree (we use 
only coordinates x and y), one multiplies the space and the query time by factor log to'. Since the 
original data structure uses space O(to') words and answers in O(logn'loglogn') time, we obtain 
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the bounds above by multiplying the time and space bounds by logn'. By replacing n' by n/r, the 
total space usage is 0{n log 2 n/r) bits and the query time is 0(log 2 n log logn). 

Given a factor p of length at least £, we will issue r queries each of which will take 0( log 2 n log log n) 
time. The final result will be maximum over all the results of the queries. In order to determine the 
queries to the 3D range-max structure, we will binary search the table Tg for every suffix of p of 
length i € [l..r] (we first revert the suffix in time 0(r using the table LT). This will determine 
the range \x\,x?\. The ranges [yi, j/ 2 ] are determined by querying the BWT of X in total time 0{£) 
(by backward searching). 

Thus the total query time will be 0(r( log 2 n log log n + r 1 -^-) + £) and the space 0(n log 2 n/r). 
Choosing £ > log 5 n and r = log 2 n ensures that the total time per factor is 0(log 4 log log n+£) which 
amortizes to O ((log 4 log log n + l?)/log 5 n + 1) = 0(1) time per character of the factor. The total 
space is dominated by the space used by Tg which is 0(n logo - ) bits, and the total preprocessing 
time is dominated by the time needed to construct T 2 which is 0(n(^^ + |2|^)) = 0(n|^|^). 


6.3 Sparsified tree 

If we divide the universe x[l..n] into blocks of equal size B and moreover only solve queries for 
factors whose suffix array range crosses a boundary and whose phrase lengths is at most £, then 
the number of nodes considered can not be more than 0((n^)). To justify this, consider for every 
boundary the deepest node that crosses a specific boundary. Obviously this node is unique, since if 
two nodes cross the same boundary, one has to be parent of the other and then one of them would 
not be the deepest. Thus there can be not more than 0(n/B) such nodes. We call those nodes 
basic nodes. On the other hand, any node that crosses a boundary has to be ancestor of one of the 
basic nodes. Since, by definition a basic node can not have more than £ ancestors, we deduce that 
the total number of nodes is n' = 0((n-^)). Recall now that the algorithm described in Section T6.31 
traverses the tree of phrases and for each leaf updates the minimum of the nearest marked ancestor 
and then for each phrase computes the rightmost pointer by traversing the whole subtree under 
the node of that phrase. Since, there are at most a phrases per node, each of the n' nodes will be 
traversed 0(£a) times, at most a times for each of its (at most) £ ancestors. Thus, the total cost 
will be 0(n + n'£a ) = 0(n + n^-). Choosing B = £ 2 a ensures 0(n) overall running time. The total 
additional used space will be 0(n ) bits dominated by the space needed to store the nearest-marked 
ancestor information (see Section ED. 

6.4 Remaining factors 

We will use Theorem 0] for short factors that do not cross a block boundary. For each block we build 
a range-predecessor data structure. We can use parameter N = n and use a global precomputed 
table that adds o(n) bits of space. Since each block contains at most B points, construction takes 
0(B\ log B /ydog n]) time. Each query is solved in time 0((log B) 1//c log logn). Choosing B = £ 2 a 
means total construction time adds up to 0(n [log a/y/\ogn |). This dominates the total query 
time, which adds up to 0( , n log 1//c (£ 2 <r) log log n). Notice that the y coordinates in each block are 
originally in [l..n]. In addition to the range-predecessor structure, we will use a predecessor structure 
to reduce the y coordinate of a query to the interval [1..B], For that we assume that we have available 
all the values y coordinates of the points that fall in the block sorted in increasing order. We also 
assume that the y coordinates of the points stored in the range-predecessor data structure have 
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been reduced to the interval [1..,£?]. That is, instead of storing the original y coordinate of each, we 
store the rank of that coordinate among all values of y coordinates that appear in the block. 

6.5 Putting pieces together and getting optimal space 

Combining together the three categories above, we can get total time 0(n(l H— 1 ? B<T )) and space 

V Io e n 

0(n log n) bits. Details are shown in Appendix lF.il The space can be reduced to optimal 0(n logo - ) 
bits. This is shown in Appendix IF. 21 We thus have proved the following. 

Theorem 6. We can find the rightmost occurrences of Lempel-Ziv factors in time 0(?r(log log a + 
logu/yTogn)) and space O(nlogcr) bits. The time is 0(n( 1 + loga/y/logn)) if randomization is 
allowed. 

7 Conclusions and Open Problems 

We leave two main open problems. Firstly, is it possible to compute the rightmost parsing in 
0(n ) time, independent of the alphabet size? Note that even using 0(n log n) memory and 0(n) 
time would be interesting. The algorithm introduced in Section [6] is the current fastest running in 
0(n(l + log ofsj log n)) time (and using compact space). Secondly, are the time bounds we achieve, 
or anything o(n log n) for that matter, possible if processing must be online? 
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A Wavelet tree construction 


We first describe the core procedure used to build the wavelet tree and then describe the wavelet 
tree construction itself. We then describe how range-predecessor queries are solved using the wavelet 
tree. 

A.l Core procedure 

Suppose that we have are given a parameter N < 2 W bits, and that we can spend o(N) time 
preprocessing to build (universal) tables that occupy o(N ) bits of space. The core procedure to 
build the wavelet tree unnau is as follows. We are given an array of integers V[1 ..m] and a bit 
position p, where V[i] € [0, cr] for all i € [1 ..m\, and the goal is to produce two arrays Vo[l..mo] and 
Vi[l..mi] such that Vo is the subsequence of integers of V whose pth most significant bit equals 0 
and V\ is the subsequence of elements of V whose pth most significant bit equals 1. We will describe 
a procedure that runs in time 

In order to fill the vectors Vo and V\ we use two counters Co and Cl , initially set to 1. We scan 
the array V and read it in blocks of B = elements. Suppose that a block contains to elements 

whose bit number p equals 0 and t\ = B — to whose bit number p equals 1. We denote by b the 
block and by bo (resp. b\), the subsequence of elements of b whose bit number p equals 0 (resp. 1). 
We append the blocks bo (resp. b±) at the end of Vo (resp. V\) respectively at positions indicated 
by Co (resp. C\) and then set Co = Co + to (resp. C\ = C\ + 1\). 

We will use a lookup table L[l..a B ] that produces tuples (bo, to, b\, t\) for every possible block 
b of B elements of length logo - bits each. Since we have a B = 0(y/N) possible blocks and every 
element in the table uses O(logA) bits, the size of the lookup table will be 0(\/ATog A) bits. 

Reading a block b, reading the entry L[b], incrementing Co (Ci), and appending bo ( b \) to Vo 
(Vi), can all be done in constant time, since each of the blocks b,bo,b\ fit in log A/2 < w bits 
and reading or writing blocks of this size requires only a constant number of bit shifts and bitwise 
logical operations. 

A.2 Construction 

A wavelet tree for a sequence iS'[l..m] over alphabet a can be built in time 0(m ^ a ^ —I- cr) by 
repeating the core procedure for each node of the wavelet tree. 

More precisely at first level, we use the procedure to produce two arrays S'ofl-.mo] and Si[l..mi] 
such that So is the subsequence of integers of S whose most significant bit equals 0 and Si is the 
subsequence of elements of S whose most significant bit equals 1. Notice that the bitvector stored 
at the root can be trivially obtained from S in 0(m ) time, by just scanning S and extracting the 
most significant bit of every element of V and append it to the bitvector. This process can be 
accelerated to run in 0(m ^°^ ), by using again a lookup table that givens the sequence of B most 

significant bits for every possible blocks of B = 9 \°g^y blocks of characters. 

At the second level, we will apply the same algorithm to So (Si), to get the subsequence to 
produce two arrays Soo[l- m o] and £oi[l..mi] (£io[l--mo] and Sufi..mi]), such that Sbo^io) is 
the subsequence of integers of S whose second most significant bit equals 0 and Soi(Sn) is the 
subsequence of elements of S whose second most significant bit equals 1. At that point, we can 
generate the bitvector 6 0 (^ 2 ) that contains the second most significant bit of So (Si) in time 
0(|5b| /og)v 1) (0(|<Si| + 1)) and finally throw So (Si). The generation of those two bitvectors 
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(which are to be stored at the two children of the root of the tree) can also be done in total time 
+ 1). Once a bitvector has been generated we index it so as to support rank and select 

queries. This is done in times 0(|So|iog]7 +1) an d 0(|S'i|^^ + 1) respectively for Sq and S\ using 
the technique described in jT], which uses lookup tables of size o(N) bits. We continue applying the 
algorithm in the same way for every node at every level, until we get to a leaf of the wavelet tree. 

Since we have a nodes and log a levels, the total running time is 0(m^^-+cr). The total space 
used for the lookup tables is 0(y/N\og IVlogcr) bits and the total temporary space used during the 
construction is 0{m logo - ) bits, since only bitvectors are kept after a given level is constructed. 

We thus get the following lemma: 

Lemma 1. Given a sequence Y of length m over alphabet [l..cr] and global precomputed tables of 
total size o(N), where a < m < N < 2 W , we can build the wavelet tree over Y in time 0(m ^ iog 7 V + 
a), using 0(n log a) bits of temporary space. 

A.3 Range-predecessor queries using wavelet tree 

We now show how range-predecessor queries are solved using a wavelet tree. We are given integers 
x\,X 2 and y 2 and must find the point (x,y) such that x\ < x < X 2 , y <y 2 and v is maximal. We 
assume that there is no point (x, y 2 ) such that x € \xi,x 2 \. Otherwise, the answer to the query is 
trivial. The query proceeds in two phases. In the first phase, we find the longest common prefix 
between 1)2 and y, and in second phase, we determine the remaining bits of y. At this second phase, 
a bit number i is determined by traversing a node at level i and its value is the maximal value 
for which the query issued at that node gives a non-empty answer (queries and their answers are 
defined more precisely below). 

The first phase proceeds as follows. At the root level, we check whether there interval [aq, x’ 2 ] in 
the bitvector b stored at the root contains an occurrence of y[ 1], by checking that rank(6, y[l], aq — 
1) < rank(6, y[l], £ 2 ). If that is the case, we continue to child number y\ 1] of the root and recurse 
using the interval [x\ , x 2 \ = [rank(6, y[l], aq — 1) + 1, rank(6, y[l], x 2 )\. Let b y n 1 be the bitvector 
associated with child number y[ 1], At the next level, we use two rank queries on b y vo with symbol y[ 2] 
and points x\ and x 2 and check whether the interval [rank(6 2/ [ 1 ], y[2], x'i — 1) + 1, rank(6 J/ [ 1 ], y[2], x 2 )] 
is non-empty. We continue in the same way down the tree, until we reach a node a for which we 
have an empty interval. Let i be the level of that node. Suppose that y 2 [i\ = 1, then the answer is 
in the subtree of a, and we can deduce that the longest common prefix between y 2 and y of i — 1. 
If y 2 [i] = 0, then we go up the tree decrementing i at each step and at a node a at level i, check 
whether 7/2 [*] = 1, and if so requery the bitvector at that node with the interval with which we 
already queried it before, but this time with bit value 0. If the query is successful, we stop climbing 
the tree and we will have determined that the longest common prefix between y and y 2 is i — 1. 

We now describe the second phase of the query. The remaining log a — i + 1 bits of y can be 
completed by traversing down from node a. 

For that we continue by querying the bitvector at node a with the same interval which we 
already used for querying node a, but this time using bit value 0 instead of 1. 

We then continue to traverse down the tree, for each traversed node querying the bitvector for 
bit value 1. If the returned interval is non-empty, we continue traversing down the tree with the 
interval. Otherwise we query the for bit value 0 and continue traversing with the returned interval 
(which is necessarily non-empty). When we reach the leaf, we will have constructed the whole value 
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of coordinate y and will have a singleton interval [1,1]. In order to reconstruct the value of x, we 
climb up the tree retraversing (in reverse order) the nodes we already traversed, and for a node at 
level i issue a select query using the bit value y[i] for the single position that we obtained from the 
previous select query (or position 1 for the first select query). At this point we will have determined 
both coordinates x and y of the answer. 

B Range-predecessor query answering 

We now describe how we answer range predecessor queries with the data structure used in The¬ 
orem [3j The algorithm can be thought of as a generalization of the query algorithm used for 
the wavelet tree. We are given integers x±,X 2 and 7/2 and must find the point (x,y) such that 
xi < x < X 2 , y < 2/2 and y is maximal. A query will first proceed by traversing the tree top-down, 
where at level i a rank query at a node cr* will allow to determine the range in Y ai+1 from the range 
in Y Qt . where aq+i is the next node at level i + 1. The range minimum query data structure at 
level i will allow to determine whether the y coordinate of the answer shares at least i chunks with 
j/ 2 - Once it has been determined that the y coordinate shares a chunk of length t with 7 / 2 , then 
the value of the next chunk (chunk number i + 1) of y should be smaller than the corresponding 
chunk of 7/2 and then the next chunks will all need to have maximal values. Hence, we will use 
range maximum queries for at all next levels. With the y coordinate determined we can read the x 
coordinate from X[y]. 

We now give a detailed description of how the queries are solved. Before delving into the details, 
we first show how to handle the easy case in which the answer is a point (x,y) such that 2 / = 2 / 2 - 
To eliminate the case, it suffices to test that x\ < X [ 2 / 2 ] < X 2 , and if it is, then the answer is 
(X [ 2 / 2 ], 2 / 2 )- We now show how queries are answered under the assumption that X [ 2 / 2 ] ^ \xi,x?\. 

We traverse the tree top-down, successively for the root node, then the node oq = 2 / 2 [l..\/log n \i 
then the node 0:2 = 7/2[l--2\/log n \ and so on. For the root node, we first compute the value 
mo = min(y[xi..X 2 ]). Then query the predecessor data structure P(o, Ql ) to check whether Y[x 1 , X 2 ] 
contains the value oq. The predecessor data structure will then be able to return a pair of values 
(xqo, 2 - 2 , 0 ) where (xqo) (resp. (x 2 ,o)) is the leftmost (resp. rightmost) position in [xi,X 2 ] such that 
P[^i,o] = ot\ (resp. y[x 2)0 ] = ot 1 ). If the interval H[xi,X 2 ] does not contain the value oq, we stop at 
the first level. Otherwise, we go to the second level to node a 2 , compute mi = min(y[xi ) o..X 2 ,o])i 
and query the predecessor data structure P (rt2 y2 ^ 2v /iogT]) ^ or ^ ie P a ^ r ( x i.O) x 2,o) to check whether 

Y a 1 [xqo, X 2 ,o] contains the value 7/2 [Vlog n+1..2\/\og nj. If that is the case, then the predecessor data 
structure will return a pair (xiq, X 2 ,i) such that (xiq) (resp. (x 2 ,i)) is the leftmost (resp. rightmost) 
position in [xi,o,X2,o] with y[xi,i] = 2/2 [Vl°g n +1 • • 2 Vlog n\ (resp. T[x2,i] = ^[Vlogn+l.^Vlognj). 
We continue the traversal of the tree in the same way until the predecessor query fails at a certain 
level 

We let j be the deepest level such that rrij < 2 / 2 [(j — l)\J\ogn + \..j\J\ogn\. This tells us that 
the final value for y is prefixed by 2/2 [ 1 • - (J — l)\/log n \i but is followed by a chunk that differs from 
(more precisely, is strictly smaller than) 2 / 2 [(j — l)\J\ogn + l..jy/\og n\. 

Then we query the wavelet tree W a with a = otj to find the range predecessor of tj 2 [(j — 
\)^J\ogn + l..ji/Togn\ in the interval Y a [x ij--X 2 j]. This will produce the next y^Iogn bits of y 
(denoted by y') and an interval (xij+i, X 2 ,j+i). We then continue to the node a = 2/2 [1 - - (j — 

5 Note that one of the predecessor queries will have to fail, because we have eliminated the case that the y component 
of the query answer equals 7/2 - 
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l)\/Iog n\ ■ y', but this time the next bits of y will be produced by the range maximum query 
data structure over interval Y a [x ij+i-.a^q+i]- We continue traversing the tree in this way until the 
bottom node, at which point we will have induced the full value of y. To get x we simply read X[y\. 

B.l Faster queries 

We now show how queries are answered using the data structure from Theorem |4l The algorithm 
can be thought of as generalization of the query algorithm presented in beginning of Section [Bj 
Recall that we are given integers aq, a ?2 and y 2 and must find the point ( x , y) such that aq < x < X 2 , 
y < y 2 and y is maximal. As usual we eliminate the trivial case that aq < X[y 2 ] < X 2 , in which case 
the answer is (X[y?\, 2 / 2 )- As before the query proceeds in two phases. The first phase determines the 
longest common prefix between y and y 2 and the second phase allows us to determine the remaining 
bits of y. Finally, the value of x can be determined by reading X[y\. We now give details of the two 
phases. In the first phase, the longest common prefix between y and 7 / 2 , is determined in chunks 
of log( c_1 )/ c n bits by traversing the tree of granularity log( c_1 " c n. Then at most log( c-1 )/ c n — 1 
bits will remain to be determined, and we continue from a node in the tree at level of granularity 
log^ c_2 ^ c n, labelled by 1 ) 2 [(i — 1) log( c_1 )/ c n + 1, i log- c ~ 1 P c n], for some integer i. Since we have to 
determine less than log ^ c ~ l P c n bits, the number of traversed tree levels will be less than log^n. 
We then continue refining the length of longest common prefix by traversing trees at decreasing 
levels of granularity until we reach a node at level of granularity 1, which is an ordinary wavelet 
tree. As before at each node, we will use a range-minimum query to determine whether the label 
of the node is a prefix of the longest common prefix of y and y 2 and use the predecessor query to 
determine whether we continue exploring the next node at the next tree level and the interval to be 
used at that next node. In the second phase, we will traverse trees of increasing level of granularities, 
from the node at which the first phase has stopped. This time we will use range-maximum queries 
to determine both the next node to explore and the next chunk of bits of y (the chunk length being 
the level of granularity). We switch from a level of granularity \og d ^ c n to the next one of granularity 
log( rf+1 )/ c n, whenever the number of determined bits of y is multiple of log^ +1 ^ c n and the first 
node at that level will be the one labelled by the bits of y which have been determined so-far. As 
before the range to be used at next tree level will be determined using predecessor query on the 
range at current node, using the current chunk determined from the range-maximum query. It is 
easy to see that the query time of both phases is O^clog 1 ^ n log log n), since we have c levels of 
granularity and at each such level we traverse at most log^n nodes, spending O(loglogn) time at 
each node. This finishes the description and analysis of the queries. 

We traverse the tree top-down, successively for the root node, then the node aq = ?/ 2 [l..\/l 0 g nj, 
then the node 0,2 = 7/2 [1 --2^log and so on. For root node, we first compute the value mo = 
min(F[xi..X 2 ]). Then query the predecessor data structure P( 0 ,ai) to check whether Y{xi,X 2 ] con¬ 
tains the value aq. The predecessor data structure will then be able to return a pair of values 
(xqo, 2 : 2 , 0 ) such that (aq,o) (resp. (aq,o)) * s the leftmost (resp. rightmost) position in [aq,^] such 
that R[aq,o] = aq (resp. Y[x 2 , 0 ] = ai). If the interval Y[ aq, aq] does not contain the value aq, we stop 
at the first level. Otherwise, we go to the second level to node « 2 , compute m\ = min(y[aq,o..a: 2 ,o]) 
and query the predecessor data structure ^ ^ 2 ydogn]) t° r the P a h' (aq,o, aq,o) to check whether 

Y ai [x 1 , 0 ?^ 2 , 0 ] contains the value y 2 [y/\ogn + l.^ylogn]. If that is the case, then the predecessor 
data structure will return a pair (aq^, 2 : 2 , 1 ) such that (aqq) (resp. (aq,i)) is the leftmost (resp. 
rightmost) position in [aq,o, aq,o] such that y[aq,i] = 2/2 [Vlog n + l-.2\/lognj (resp. Y[x 2 , 1 ] = 
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y 2 [v^log n + 1..2Vlog nj). We continue the traversal of the tree in the same way until the predecessor 
query fails at a certain level Jl. 

C Elias-Fano based predecessor data structure 

We now show how Elias-Fano predecessor data structures are built. Suppose that we have a set 
S of n keys from interval [l..u] (where for simplicity u and n are powers of two). We can show 
a data structure that uses n(2 + log(u/n)) + o(n) bits of space and that allows us to answer to 
predecessor (rank) queries in O(loglogu) time as well as finding the key of rank i (select queries) in 
constant time. The Elias-Fano encoding is composed of two substructure. Let x\ < x 2 < ... < x n 
be the sequence of keys to be encoded with Xi € [l..u] for all i € [l..n]. The first substructure is an 
array A[l..n], where A[i] contains the least significant log(«/n) bits of X{. The second substructure 
is a bitvector V of length 2 n bits which contains the sequence (P 1 IOX 2 — x\... 0 Xn ~ Xn ~ 1 l, where 
x\ = Xi/(u/n). In other words, the bitvector V encodes the most significant logn bits of elements 
Xi. In order to support the select operation for position i (computing Xj), we can first go to A[i\ to 
retrieve the least significant log(it/n) bits of x* and then do a select query to find the location of the 
ith one in V. If that location is j then the value of the most signihcant log n bits of x are equal to 
the number of zeros before position j, which is j — i. Thus a select query can be answered in constant 
time. To answer to rank queries, we will build a y-fast trie predecessor data structure (42) on the 
set S' of n/ log 2 u keys xi, x log 2 u+1 ,..., x n . This data structure will occupy 0(n/ log u ) = o(n ) bits 
of space and allows to determine the prededecessor of a query key x in S' in O (log log u) time. This 
will allow to restrict the predecessor search in S' to a small interval of size n. The predecessor search 
can then be completed by a binary search over that set, by doing select queries. This binary search 
also takes O(loglogu) time. 

The data structure can be generalized as follows. Given a parameter v < u (again assume that 
v is a power of two), we can build a data structure that occupies v + n(l + log u/v) + o(v + n) bits 
of space and that answers to rank and select queries within the same amount of time. To implement 
the data structure, we will use a bitvector V of size n + v bits with n ones and v zeros (the ones 
and zeros are stored as was defined before except that x[ = Xi/(u/v)) and store in V the least 
significant (u/v) bits of each key. The query time bounds are preserved. 

Lemma 2. Given a set S C [l..u] with |5| = n and a number v < u, we can build a data structure 
which occupies v + n( 1 + log u/v) + o(v + n) (assuming n,v and u are powers of two) and answers 
to rank queries in time O(loglogu) and select queries in constant time. 

C.l Simple Construction 

We can now show the following lemma: 

Lemma 3. Given a sequence Y of length n over alphabet [l..er] (where a < n and both a and n 
are powers of two), we can build a predecessor data structures so that the data structure number c 
stores the positions of character c in the sequence Y such that: 

1. The total space occupied by all predecessor data structures is n( 2 + log a) + o(n) bits of space. 

6 note that one of the predecessor queries will have to fail, because we have eliminated the case that the y component 
of the query answer equals 2 / 2 . To see why notice that the last query is for node y a where a = 2 / 2 [1.. logn — W°gn] 
and the last query is for y' = y 2 [log n — -^log n + 1.. log n] 
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2. The total construction time of the data structures is 0(n ) time. 

3. A predecessor (rank,) query is answered in time O(loglogn) and a select query is answered in 
constant time 

Proof. We will build a Elias-Fano data structures (generalized as above) denoted E a for a € [l..cr]. 
For each data structure we set v = n/a. The space used by data structure a is u+n a (l+log(n/u)) + 
o(v + n a ) = v + n a { 1 + logo - ) + o(y + n a ). Since n a and v sum up to n, we get that the total space 
usage is n(2 + logo - ) + o(n) bits of space. The vectors V a and A a can be built easily in 0(n a ). For 
that, we can first build the sequence of positions P a from Y (initially the sequences are empty), by 
scanning Y and appending for each Y[i\ = c append position i to sequence P a . Then, building V a 
and A a is done by scanning P a and for each element writing its least significant logo - bits in A a 
and writing a number of zeros and a one in V a (the number of zeros is based on the log n — log a 
most significant bits). 

In order to support rank queries, for each E a containing at least log 2 n elements, we sample 
every log 2 nth occurrences of character a in Y storing the resulting positions in a y-fast trie Tr a . A 
predecessor query on E a is then solved by first querying Tr a , which will answer in time O(loglogn) 
and complete with binary search for an area of length at most log 2 n doing 0(log log n ) select queries 
on E a . The construction of Tr a clearly takes 0(1 + \E a \) and the space is clearly 0(\\E a \/log n\) 
bits of space. When added over all a £ [l..cr], the construction time for rank and select structures 
on V a sums up to 0(n) and for all Tr a sums up to 0(cr + n/ log 2 n ). This finishes the proof of the 
lemma. □ 

C.2 Bit-parallel Construction 

We now exploit the bitparallelism to show a faster construction, showing the following lemma: 

Lemma 4. Given a sequence Y of length n over alphabet [l..cr] (where a and n are both powers 
of two) and a global precomputed table of size N such that a < n < N < 2 W , we can build a 
predecessor data structures so that the data structure number c stores the positions of character c 
in the sequence Y such that: 

1. The total space occupied by all predecessor data structures is n(2 + logcr) + o(n ) bits of space. 

2. The total construction time of the data structures is 0( n ^°^ —I- a). 

3. The temporary space used during the construction is 0(n logcr) bits. 

4- A predecessor (rank,) query is answered in time O(loglogiV) and a select query is answered in 
constant time 

The global precomputed table can be shared by many instances of the data structures. 

Proof. We will build cr Elias-Fano data structures (generalized as above) denoted E a for a € [1. .cr]. 
For each data structure we set v = n/a. The space used by data structure a is v+n a (l+\og(n/v)) + 
o(y + n a ) = v + n a ( 1 + logcr) + o(v + n a ). Since n a and v sum up to n, we get that the total space 
usage is n(2 +log cr) + o(n) bits of space. As in Lemma [3] we will build a Elias-Fano data structures 
(generalized as above) denoted E a for a € [1..cr], in which we set we set v = n/a for each data 
structure. As shown above the total space usage of the data structures sums up to n(2+log a)+o(n). 

We now describe the construction procedure. The construction proceeds in logcr phases. We 
describe the first phase, the subsequent phases will be (almost) identical, but will have different 
input and output. Before doing the first phase, we construct a vector A[l..n] such that A[i] = 
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i mod logo - and a bitvector V = (01 logfT ) ri//logo '. Notice that these two vectors represent together 
the Elias-Fano representation of the sequence 1, 2,... n. The phase will have as input the arrays A, 
V and the sequence T and will output a pair of vectors (Ao, Ai), a pair of bitvectors (Vo, Vi), and 
a pair of sequences (To, Y\). The sequence To (resp. Y ]) will store the subsequence of characters 
from Y which belong to the first (resp. second) half of the alphabet. The array A$ (resp. A\) will 
store the values from A whose corresponding positions in Y belong to first (resp. second) half of 
the alphabet. Finally the bits of V are copied into Vo and V±. More precisely every 0 in V is copied 
to both Vo both Vi while every 1 is copied to either Vo or Vi. The vector V is scanned right-to-left 
and the ith 1 in V is in correspondence with the zth element of Y. If Y[i] belongs to the first (resp. 
second) half of the alphabet (this can be checked by looking at most significant bit of Y [i]). then 
it is appended to Vo (resp. Vi). Whenever a 0 is encountered in V, it is appended to both Vo and 
Vi. One can now easily see that Aq and Vo (resp. A\ and Vi) is the Elias-Fano representation of 
the occurrences of characters from the first (resp. second) half of the alphabet in the sequence Y. 
The first phase can thus easily construct the output by simultaneously scanning A, Y and V in 
left-to-right order and appending at the end of Aq, Ai, Vo, Vi, To and Y\. In order to efficiently 
implement the first phase, we will make use of the four Russian technique. We read A, V and T 
into consecutive blocks of b = [log N/3 logo - ] consecutive elements, which occupy b(l + 2 log a) bits 
of space. We use a lookup table which for each combination of 3 blocks from A, V and Y, will 
indicate the blocks to be appended at the end of Aq, A\, Vo, Vi, To as well as by how much we 
advance the scanning pointers into A and T (which will be the number of Is in the block read from 
V ). The information stored for every combination easily fits into O(logn) = 0(log N) bits of space 
and the total number of possible combinations is 0(2 f, ( 1 + 21o s°')) = 0(2 2logN ^ 3+log < r N ) and thus the 
lookup table occupies 0(2 21ogAr / 3+logCT N log N ) bits of space. In the second phase, we build a pair of 
vectors (AxbAn) (resp. (Aio, An)), a pair of bitvectors (Voo,Voi) (resp. (Vio,Tn)), and a pair of 
sequences (loo, Voi) (resp. (Tio,Tn)) based on Aq, Vq and To (resp. A\, Vi and Ti). The procedure 
is similar to the one used in the first phase, except that now the distribution of elements in the 
output is based on the second most significant bit of elements of To (resp. Y\). The lookup table 
for the second occupies the same space as the one built in the first phase. At the end of the second 
phase the pairs (Aoo, Too), (Aoo, Too), (An, Voi), (Aio, Tn) will represent Elias-Fano representations 
of the occurrences of characters from respective subalphabets [l..cr/4], [<t/4+1..<t/ 2], [<r/2-t-1..3cr/4] 
and [3<r/4 + l..cr] in the original sequence T. We continue in the same way doing logrr — 2 more 
phase, where at phase i we will have built Elias-Fano for sub-alphabets of size a/ 2*. At the end we 
will get the Elias-Fano representation of the occurrences of each distinct character in the sequence 
T. We now analyze the time and space used by the algorithm. The total space used by all the 
O(logcr) lookup tables will be 0(2 21ogAr / 3+log ^ N log A log a) = o(N ) bits of space. For the running 
time it is easy to see that a phase i runs in time 0(2* + n/b) time, since we have 2* sub-alphabets 
and we process b elements of each processed vector in constant time. Over all log a phases the total 
running time sums up to 0(a + n\oga/b) = 0(a + ). Let E a = (V a ,A a ) be the Elias-Fano 

representation for occurrences of character a € [l..<r]. In order to complete the construction, we 
need to construct the support for rank and select queries on each vector V a . This can be done 
in time 0(1 + |T*|/log A) for the bitvector \V a \, by using the construction in [T]. This allows us 
to support select queries on E a in constant time. In order to support rank queries, for each E a 
containing at leat log 2 N elements, we sample every log 2 Ath occurrences of character a in T by 
doing select queries on E a for positions 1, log 2 N , 2 log 2 N,... and store the resulting positions in a 
y-fast trie Tr a . A predecessor query on E a is then solved by first querying Tr Q which will answer 
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in time O(loglogn) and complete with binary search for an area of length at most log 2 N doing 
0(loglog N) select queries on E a . The construction of Tr Q clearly takes 0(1 + \E a \/ log N) and the 
space is clearly 0(|J-E' Q |/log./Vj) bits of space. When added over all a € [l..<r], the construction 
time for rank and select structures on V a sums up to 0(cr + n/log N) and for all Tr Q sums up to 
0{a + n/ log 2 N). This finishes the proof of the lemma. □ 

D Sampled range-minimum queries 

We first start by constructing a sampled sequence S' from the sequence S we want to index. Initially 
S' is empty. Suppose that we allow o(N ) precomputation of a table of size o(N) bits that can be 
shared by all instances of the range minimum or maximum queries. We assume that N > a. We 
divide S into blocks of size b = elements each, then scan it in left-to-right order, compute 

minimum element in each block and append the result at the end of S'. We then build a rmq R' 
on top of S' in time ^ ]). To compute the minimum in each block, we use a lookup table 

that returns the minimum element on all possible blocks of size b. The table stores 0(a b ) elements 
each of length log a bits, for a total space of 0(a b log a) = o(N) bits. 

To answer a rmq query for S[i..j], we first check whether the query is contained in one or two 
blocks of S. If that is the case, then we read the blocks compute the minimum in each block in 
constant time, and compute the minimum of all the 2 block also in constant time. For that we 
use a precomputed table that will contain 0(a b log 2 b ) elements each using log b bits. The table 
contains the answers to all possible queries (0(log 2 b)) over all possible blocks ( a b ). The space is 
0(c x b log 2 b log a) = o(N ) bits. If the query spans more than two blocks, then we can divide it into 
three parts. A head and tails parts which are contained in one block each and a central part, whose 
length is multiple of a block length. We answer compute the minimum on head and tail in constant 
time, using the precomputed table and compute the minimum on the central part using R', and 
finally take the minimum of the three. 

We thus have proved the following lemma: 

Lemma 5. Given a sequence Y of length n over alphabet [1..<x] (where u <n and both logu and 
n are powers of two) and global precomputed tables total of size o(N), where N < 2 W , we can build 
a range min data structure on top ofY: 

1. The space used by the data structure is 0(n/ log^ N) bits. 

2. A query is answered in constant time. 

3. The data structure can be constructed in time 0(n/ log CT N). 

4■ The temporary space used during the construction is 0(n/ log^ N) bits. 

The precomputed tables can be shared by many instances of the data structure and can be built in 
o(N ) time. 

E Construction of the fast range-predecessor data structure 

In this section, we show the construction of the data structure used in Theorem [4j The main 
ingredients are lemmas [5] and 0J We let L\ = log l ^ c n be the level of granularity such that log*/ c n > 
VlogiV and L 2 = log^ -1 ^ c n < \/log N. For level L\ and higher, we will use simple linear time 
algorithms to build the predecessor and range minimum (maximum) data structures. The time for 
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these levels is dominated by the time used for level L\ for which the total number of elements stored 
in all sequences (and hence in predecessor and range minimum or maximum data structures) at all 
nodes will be 0{n\ogn/L\) = 0(n-^==). The sum of alphabet sizes at all nodes is clearly O(n). 

Hence the construction time at that level will be 0(n(l + —j)p=)). For the next higher level the 
construction time will be 0(n( 1 + -— ]/V log ^)—= =)). Continuing the same way we deduce that the 
construction time for these levels will be 0(n(ci -\— 1 , ug " )), where ci is the number of levels. 

For level L 2 and lower we will use bitparallel construction algorithm. For an given node a 
containing n a elements, the construction time will be 0(a + n a ) (which is dominated by the 

time for construction the Elias-Fano predecessor data structure), where a = 2 L2 = 0(2V /lo sThe 
total sum of the a term over all nodes is t\ = 0(n ) and the total sum of the terms n a ^ will 
be t ‘2 = 0(-jr l - ■ ), since the total number of elements stored in all nodes is Oin/L^). Thus, we 

have t ‘2 = 0 ( — )• Since we have L 2 < \AogW, we conclude that t 2 = 0( We notice 

that f ,2 will decrease by factor log 1//c n each time we go to the next smaller level of granularity. 

The term t\ remains stable. Since, the term £2 of level L 2 will dominate the terms £2 of all lower 

levels, and the term t\ = 0(n ) remains stable over the lower levels, we conclude that the total 

construction time for level L 2 and lower levels will be 0 (n(c 2 H— 1 , ogn )), where Co is the number 

y/\ogN 

of levels. Summing up the construction times of all levels we get the construction time stated in 
Theorem [JJ 


F Rightmost Parsing 
F.l The full picture 

We now present the complete algorithm. We first categorize each query. Queries that fall in the same 
block are redirected to the range-predecessor data structure responsible for handling that block and 
queries that are longer than t are put aside to be solved by the data structure for handling long 
factors. For the remaining factors (short factors with ranges that cross block boundaries) we will 
construct the tree of queries. This can be done in 0{n ) time, by sorting the query ranges first by 
starting positions and then by the negation of their ending positions. The sorting can be done in 
0{y/n + n/ log CT n) time using radix-sort. 

The rest of the algorithm is straightforward, except for few details on how the points stored in 
the range-predecessor data structure are constructed. This is done as follows. The generated points 
are pairs consisting of a suffix array position (x coordinate) and text position (y coordinate). We first 
notice that we need to divide the points according to their x coordinates, such that the points whose 
x coordinate is in [1..-B] go to the first range-predecessor data structure, points with coordinate 
in \B + 1,2 B] go to second data structure and so on. The points are generated by inverting the 
Burrows-Wheeler transform in left-to-right order. This generates the values of y coordinates that 
appear in each block in sorted order. Finally, for each block, we need as input to building its range- 
predecessor data structure, the points sorted by x coordinate and for each point the y coordinate 
replaced by the rank of the value among all values of y that appear inside the block. Since we 
extract the points by increasing value of y, we can keep an array C[l..n/B] that stores the number 
of points that have been extracted so far for each block. The counter is incremented each time 
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we extract a point, and the rank of the y value of a point that has been just extracted in block i 
will equal the counter C[i\. Finally we need to sort all the extracted points by their x coordinates, 
and this done via radix-sort. The algorithm uses O(nlogn) bits and 0(n(l + logu/Vlogn)) time 
overall @- 

F.2 Optimal space 

It remains to show how to reduce the space to the optimal 0{n\oga) bits. The additional space 
used for long factors (Section 16.21) and short factors crossing block boundaries (Section 16.31) is O(n) 
bits. The bottleneck is for the factors with small range ( see Section f6.4p in which the used space 
is 0(n log n). In what follows, we show how to reduce the space to 0(n logo - ) bits. 

To this end we divide the range of y into log n/2 equal sub-ranges. We now build the tree of 
all queries that do not cross a block boundary in the same way we did for the short factors that 
cross a block boundary in Section 16.31 As before, for every leaf of the suffix tree, we associate the 
nearest ancestor that corresponds to a query range. Additionally for each node in the query tree, 
we associate a bitmap B a [l.. log n/2] of size log n/2 bits. The query tree occupies 0(n log a) bits. 
We now invert the BWT and, for each leaf that has been extracted, we mark the bit B a [i], where 
i is the block in which the text position falls and a is the internal node pointed to by the leaf. 
We finally do a complete postorder traversal of the tree so that the bitmap of each node is bitwise 
OR’d with all the bitmaps of its children. At the end, for every query range [la,i"a\ corresponding 
to a node a, B a will mark every subrange of y where there exists a point (a, b ) such that a € [l a , r Q ] 
and b belongs to the subrange of y. 

We now build log n/2 query trees (henceforth local query trees). For every node of the tree 
(henceforth main query tree), we know that all range-predecessor queries will have the same interval 
[Ion To] on the x axis, but a different b on the y axis. 

For every query, the answer can only be in two sub-ranges: the sub-range Rb that contains b if 
B a [Rb} is marked or the largest R' b < Rb such B a [R' b ] contains a marked bit. The log n/2 local query 
trees are built via a preorder traversal of the main query tree and for every query determining the 
one or two target local query trees to which the query should be copied. That is a query will be 
copied to the local query trees Rb (if B a [Rb] is marked) and R' b (If R b exists). 

The nodes of the query trees (and the queries attached to them) are built on-the-fly during the 
traversal of the main query tree, by keeping a stack for every query tree. For every query in a local 
query tree, we keep a pointer to the originating query in the main query tree. Note that the total 
space used by all the local query trees remains bounded by 0(n log cr) bits, which is the same as 
the main query tree. 

We will now apply the algorithm described in Section T6.41 on every subrange of y. In more detail, 
we invert the BWT left-to-right. We do this in log n/2 phases, where in phase i we will compute all 
the positions in SA that point to text positions in [(i — 1 )n/C + 1 ..iC] with C = 2n/ log n. That is, 
we output all pairs ( x , SA[x]), such that SA[x] € [(i — 1 )n/C + 1 .4(7]. During the course of phase i, 
we apply the algorithm of Section 16.41 verbatim, except that now the points are only the one with 
y € [(i — 1 )n/C + 1.4(7] and the queries are from the ith local query tree and not the main query 
tree. 

7 Notice the time is deterministic and not randomized. The source of randomization is the construction and index¬ 
ation of the BWT, which is subsequently inverted. However this is not needed anymore, since we can build the 
suffix array of X in deterministic linear time and 0{n log n) bits of space. 


23 




It is easy to see that the time complexity stays exactly as it was before. The construction time of 
the range-predecessor and predecessor data structures for every phase will now be 0((n/ log ra)(log <r/\/Iogn)). 
In particular the radix sort done on the x values can now be executed in 0(y / n + n/logn) time and 
using 0(y/n) words of space. The space usage of the range-predecessor and predecessor structures 
over all phases will be just 0(n ) bits. Both structures are destroyed at the end of each phase, so 
that the total peak space usage of the whole algorithm is 0(n log a) bits. 
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